Sunday, August 10. 2008Build Median Aggregate Function in SQLPrinter FriendlyComments
Display comments as
(Linear | Threaded)
Uh, you don't need multiple implementations if you use polymorphism.
CREATE OR REPLACE FUNCTION array_median(anyarray) RETURNS anyelement AS $$ SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END FROM (SELECT ARRAY(SELECT ($1)[n] FROM generate_series(1, array_upper($1, 1)) AS n WHERE ($1)[n] IS NOT NULL ORDER BY ($1)[n] ) As asorted) As foo ; $$ LANGUAGE 'sql' IMMUTABLE; CREATE AGGREGATE median(anyelement) ( SFUNC=array_append, STYPE=anyarray, FINALFUNC=array_median );
This is example, where aggregate function is relative slow.
postgres=# select median(a) from foo; median -------- 503 (1 row) Time: 90,089 ms postgres=# select array_median(array(select a from foo order by 1)); array_median -------------- 503 (1 row) Time: 30,101 ms
Pavel,
This is very interesting. I thought your example difference was because of the presorting, but evidentally the aggregation process is adding a lot of overhead as well. What could that be? I assumed there was a benefit to presorting especially if you have an index on the presorted column, but strangely I am not seeing the benefit in my sample data. I have an index on the table on the length column census2000tiger_arc. Looking at these 3 - I would have expected the sorted ones to perform better, but it doesn't by much if at all. Looking at the explain I realize its not using the length index I put in but instead the zip index. So I'm guessing presorting only helps if your data has an index on that field and that index is actually used. --This one returns in 2907 ms -- midlength = 134.34012, count = 19,084 (without count 2891 ms) SELECT median(length) as midlength, count(zipl) from census2000tiger_arc where zipl between 2000 and 2109; --This one returns in 2964 ms (without count 2953 ms) --midlength = 134.34012, count = 19,084 SELECT median(length) as midlength, count(zipl) from (SELECT * FROM census2000tiger_arc where zipl between 2000 and 2109 ORDER BY length) as foo; --Equivalent to yours - 1844 ms, result = 134.34012 select array_median(array(select length FROM census2000tiger_arc where zipl between 2000 and 2109 ORDER BY length));
Hello,
simply repeated calling array_append function is expensive - this function isn't optimalised for using as aggregate. Array(subselect) is much faster.
There is an error with this algorithm, as it doesn't consider the odd/even rule in the calculation. Although not a major error, the results using this function are different than all other implementations of the median function (e.g., R, Excel, etc.).
When there is an even number of items, the median is the average of the inner two middle values (since there is no symmetry to provide a middle value). If there is an odd number of items, the median is the middle index (as applied above). The examples above have odd set counts, so they work as expected. However, even set count examples fail with a "left" bias. Building on Tom's suggestion above, consider using the following (note that this only works for numeric-like inputs, or other types that can be divided by two): CREATE OR REPLACE FUNCTION array_median(anyarray) RETURNS anyelement AS $$ SELECT CASE WHEN array_upper($1,1) = 0 THEN null WHEN mod(array_upper($1,1),2) = 1 THEN asorted[ceiling(array_upper(asorted,1)/2.0)] ELSE ((asorted[ceiling(array_upper(asorted,1)/2.0)] + asorted[ceiling(array_upper(asorted,1)/2.0)+1])/2.0) END FROM (SELECT ARRAY(SELECT ($1)[n] FROM generate_series(1, array_upper($1, 1)) AS n WHERE ($1)[n] IS NOT NULL ORDER BY ($1)[n] ) As asorted) As foo ; $$ LANGUAGE 'sql' IMMUTABLE;
Mike,
Thanks for the example. Yes I was thinking about providing something like that and ideally for this one, I think you would use numeric[], numeric instead of anyelement. As I recall things working you can have both definitions of median working and the one that is the closest match to the function is the one that gets used. So if you define this - then it would get used for numerics and the other would get used for everything else. I'll have to try that though since I'm not absolutely sure that's how it works.
I tried to create median aggregate, but it gave me an error:
ERROR: syntax error at or near "(" SQL state: 42601 Character: 36 Thanks
Kursat,
Which version of PostgreSQL are you running. use SELECT version(); pre 8.0 versions don't support $$ quoting. The other possibility is just a cut and paste typo somewhere.
Pavel, thank you for posting this. Your method made a dramatic difference in the performance of medians.
When using the r_median() function from PL/R combined with your method I am getting back results in 1-2s on a table where using a MEDIAN AGGREGATE was taking literally 30 minutes. Array mutating is absolutely terrible in Postgres and if you have a large amount of rows, using array_append (or similar things, like R's plr_array_accum) get exponentially worse as your result set increases. I am disappointed that Arrays are handled so poorly in Postgres. It basically completely rules out using the CREATE AGGREGATE construct with holistic aggregates.
Dimensionally -- have you tried PostgreSQL 8.4 beta. One of the major improvements is significantly speeding up the array accumulation in aggregates so that you get the benefits of what Pavel described, and the elegance of an aggregate function.
Thank you Regina, using the array_agg aggregate is a much nicer building block.
Am I correct in understanding that the syntax I need to use is: SELECT r_median(array_agg(myColumn)) FROM myTable That is what I am using, and it is working efficiently. However, I would rather be able to do: SELECT median(myColumn) FROM myTable But I am happy with the former. thank you
That is one way. I haven't experimented with it myself. I would more likely go the direction of still making a median aggregate function but using array_agg helper functions. If you look at the definition of array_agg -- it looks like this
CREATE AGGREGATE array_agg(anyelement) ( SFUNC=array_agg_transfn, STYPE=internal, FINALFUNC=array_agg_finalfn ); So I would use the array_agg_transfn/finalfn etc to try to roll my own faster aggregates. We are planning to experiment with this in another article and see how it goes.
Hi, thank you for the additional information.
However, it does not appear to be legal for user-SQL to contain a STYPE of "internal". I get this error: "ERROR: aggregate transition data type cannot be internal"
Dimensionally,
We haven't gotten around to it yet but hope to soon to demonstrate. In the meantime, take a look at the /share/contrib/int_aggregate.sql of your PostgreSQL 8.4 install. It demonstrates how to use the new agg transition functions and the unnest function to build a custom aggregate.
Okay I looked into this further and sadly I think I was wrong. The int_agg example is nothing but a thin wrapper around the existing array_agg so it seems that to do what I really wanted to do, would require writing C code function for the last step. Its too bad the CREATE AGGREGATE doesn't have another arg like POST CONDITION or something that would allow you to further massage the INT type. That would make this way more elegant and useful.
Your implementation has a bug, in that sets of even numbers don't return the average of the two middle numbers. You can fix it by adding cases for checking if the size of an array is even or odd.
Here's a fix. I used asorted for all the case checks because you could otherwise be doing arithmetic operations on null values, which is never good. CREATE OR REPLACE FUNCTION array_median(numeric[]) RETURNS numeric AS $$ SELECT CASE WHEN array_upper(asorted, 1) = 0 THEN null WHEN array_upper(asorted, 1) % 2 = 1 then asorted[ceiling(array_upper(asorted,1)/2.0)] ELSE (asorted[ceiling(array_upper(asorted,1)) / 2.0] + asorted[ceiling(array_upper(asorted,1)) / 2.0 + 1]) / 2.0 END FROM ( SELECT ARRAY ( SELECT ($1)[n] FROM generate_series(1, array_upper($1, 1)) AS n WHERE ($1)[n] IS NOT NULL ORDER BY ($1)[n] ) as asorted ) as foo ; $$ LANGUAGE 'sql' IMMUTABLE;
This is not working anymore with pg 8.4.1 and postgis 1.4.1
Many errors found in geomunion etc. Need to be refreshed
Bruno,
What does this article have to do with PostGIS? Please don't use geomunion anymore. Its been deprecated since PostGIS 1.2 and was not improved on in 1.4. It will be destroyed in PostGIS 2.0. You should be using ST_Union instead which has the faster Cascade Union functionality. |
Entry's LinksQuicksearchCalendar
Categories
Blog Administration |
Microsoft Access has these peculiar set of aggregates called First and Last. We try to avoid them because while the concept is useful, we find Microsoft Access's implementation of them a bit broken. MS Access power users we know moving over to somethin
Tracked: Aug 12, 22:58
PostgreSQL 8.2 and above has this pretty neat feature of allowing you to define aggregate functions that take more than one column as an input. First we'll start off with a rather pointless but easy to relate to example and then we'll follow up with som
Tracked: Mar 05, 22:56